Orthogonal multimodality integration and clustering in single-cell data

Multimodal integration combines information from different sources or modalities to gain a more comprehensive understanding of a phenomenon. The challenges in multi-omics data analysis lie in the complexity, high dimensionality, and heterogeneity of the data, which demands sophisticated computational tools and visualization methods for proper interpretation and visualization of multi-omics data. In this paper, we propose a novel method, termed Orthogonal Multimodality Integration and Clustering (OMIC), for analyzing CITE-seq. Our approach enables researchers to integrate multiple sources of information while accounting for the dependence among them. We demonstrate the effectiveness of our approach using CITE-seq data sets for cell clustering. Our results show that our approach outperforms existing methods in terms of accuracy, computational efficiency, and interpretability. We conclude that our proposed OMIC method provides a powerful tool for multimodal data analysis that greatly improves the feasibility and reliability of integrated data. Supplementary Information The online version contains supplementary material available at 10.1186/s12859-024-05773-y.


S1.3 Human Peripheral Blood Mononuclear cells (PBMCs) dataset
In a CITE-seq experiment, Hao et al. [3] reported the measurement of expression levels for 228 ADTs and 33,538 RNAs in 161,764 peripheral blood mononuclear cells (PBMCs).The cell samples are collected from a cohort of eight volunteers aged between 20 and 49 years participating in an HIV vaccine trial.

S1.4 Human Tonsil Spatial CITE-seq dataset
In a CITE-seq experiment, Liu et al. [4] performed spatial CITE-seq for high-plex protein and whole transcriptome co-mapping.In this experiment, 283 ADTs and 28,417 genes are profiled in human tissue over a 2.5 mm × 2.5 mm region of interest.There are 2,492 spots being extracted from the area of interest, of each size 25 µm.

S2 Interpretation on CBMCs dataset
We conducted logistic regressions independently for each resulting cluster using the OMIC method on the CBMCs dataset, following a process similar to that described in the Interpretability section.Our analysis revealed compelling results, with the training set achieving high AUC values exceeding 0.9 and all clusters in the testing set exhibiting AUC values greater than 0.85.
Upon examination of the coefficients plot for different clusters, we made significant observations regarding the importance of specific markers in identifying distinct cell populations.For instance, in the CD16+ Mono cluster, the ADTs CD14 and CD16 displayed substantially larger positive coefficients, strongly suggesting their significance as key cell markers.Similarly, within the CD14+ Mono cluster, CD14 exhibited a notably large positive value, while the coefficient for CD16 was negative, underscoring their utility in distinguishing this particular cluster [5].
In the CD8 T cell group, the ADT CD8 emerged as a critical cell marker, further highlighting the biological relevance of these markers in characterizing specific cell populations [6].

S3 Interpretation on PBMCs dataset
For the PBMCs dataset, we perform the logistic regressions where the responses are the cluster IDs identified by the OMIC method.The results demonstrate excellent predictive power as the AUC values of the model are larger than 0.9 for both the training and testing datasets.We report the top 20 significant predictors for three clusters, annotated as Memory B cells, Monocytes cells, and NK cells (Figure S4), as an illustration example.For the Memory B cells cluster, we observed positive coefficients associated with ADTs CD72, alongside negative coefficients linked to ADTs CD45RA and CD27, playing pivotal roles in distinguishing Memory B cluster [7,8].Also, the positive coefficients ADTs CD64, RNA-LYZ, and negative coefficients of CD45-2 provide support for the Monocytes clustering, in the sense that CD64 is a cell surface marker primarily expressed on cells of the myeloid lineage, including monocytes [9].Further, significant negative expression of CD45 indicates that the cell is not a lymphocyte, which supports the possibility of it being a monocyte [10].Finally, the positive expression of RNA-LYZ suggests involvement in antimicrobial activity, which is consistent with the behavior of monocytes, particularly in their role as phagocytes [11].Positive coefficients of ADTs CD16, RNA NKG7 correspond to the result of NK cells cluster [12] [13].

S4 Additional clustering result
In this section, we present more detailed clustering results of different methods in HBMCs and CBMCs datasets.The number of clusters is fixed at 10, 12, 14, 16 for CBMCs in Table S1, and 13, 15, 17, 19, 21, 23 for HBMCs in Table S2.

Fig. S1
Fig. S1 The pie chart shows the percentage of each cell type relative to the entire cell population in the CBMCs CITE-seq dataset.

Fig. S2
Fig. S2 The pie chart shows the percentage of five coarse cell types relative to the entire cell population in the HBMCs CITE-seq dataset.There are five coarse cell types, including T cell, B cell, NK cell, progenitor cell, and Mono cell.These cell types are further divided into 27 cell types including CD14 Mono cell, CD16 Mono cell, CD4 Memory cell, CD4 Naive cell, CD56 bright NK, CD8 Effector 1, CD8 Effector 2, CD8 Memory 1 cell, CD8 Memory 2 cell,CD8 Naive cell, cDC2 cell, gdT cell, GMP cell, HSC cell, LMPP cell, MAIT cell, Memory B cell, Naive B cell, NK cell, pDC cell, Plasmablast cell, Prog B 1 cell, Prog B 2 cell, Prog DC cell, Prog Mk cell, Prog RBC cell, Treg cell.

Fig. S3
Fig. S3 Coefficients of logistic regression in the training set of CD14+ Mono, CD16+ Mono, and CD8 T using integrated RNA and ADT information as predictors.The size of each dot on the plot corresponds to the absolute value of its respective coefficient, while the color of the dot indicates the sign (positive or negative) of the coefficient.

Fig. S4
Fig. S4 Top 20 coefficients in the logistic regression on the training set of CD14+ Mono, CD16+ Mono, and CD8 T using integrated RNA and ADT information as predictors.The size of each dot on the plot corresponds to the absolute value of its respective coefficient, while the color of the dot indicates the sign (positive or negative) of the coefficient.

Table S1
Comparison of the ARI value for different methods when cluster numbers are fixed at 8, 10, 12, 14, 16 in the CBMCs dataset.Table S2 Comparison of the ARI value for different methods when cluster numbers are fixed at 13, 15, 17, 19, and 21 in HBMCs dataset.